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Abstract 

Copy number variations (CNVs) are abundant in the human genome. They have been associated with complex 
traits in genome-wide association studies (GWAS) and expected to continue playing an important role in identifying the 
etiology of disease phenotypes. As a result of current high throughput whole-genome single-nucleotide polymorphism 
(SNP) arrays, we currently have datasets that simultaneously have integer copy numbers in CNV regions as well as SNP 
genotypes. At the same time, haplotypes that have been shown to offer advantages over genotypes in identifying 
disease traits even though available for SNP genotypes are largely not available for CNV/SNP data due to insufficient 
computational tools. We introduce a new framework for inferring haplotypes in CNV/SNP data using a sequential 
Monte Carlo sampling scheme Tree-Based Deterministic Sampling CNV (TDSCNV). We compare our method with 
polyHap(v2.0), the only currently available software able to perform inference in CNV/SNP genotypes, on datasets 
of varying number of markers. We have found that both algorithms show similar accuracy but TDSCNV is an order 
of magnitude faster while scaling linearly with the number of markers and number of individuals and thus could 
be the method of choice for haplotype inference in such datasets. Our method is implemented in the TDSCNV 
package which is available for download at www.ee.columbia.edu/~anastas/tdscnv. 



Introduction 

Copy number variations (CNVs) are a form of a structural 
genomic variation referring to duplications and deletions 
of DNA segments larger than 1 kilobase in size. CNVs are 
abundant in the human genome, and it is estimated that 
they can occupy as much as 4% to 6%. 

Recently, large-scale genome-wide studies have shed 
light in many aspects and characteristics of CNVs pro- 
viding unique insights into the origins, mechanisms, 
formation, and population genetics of CNVs [1-3]. At 
the same time, CNVs have been associated with complex 
traits unexplained by recent genome wide association 
studies (GWAS) [2] and are believed to make a substantial 
contribution in uncovering the mechanisms and etiology 
of disease phenotypes that result from complex patterns 
of inheritance [2,4]. 

A variety of techniques exist for CNV detection. Initially, 
experimental studies have been performed primarily 
by array CGH, but lately due to improved resolution 
and genome coverage of genotyping arrays, a number 
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of methods have been developed relying on whole-genome 
single-nucleotide polymorphism (SNP) genotyping arrays 
which offer a more sensitive approach and are more suit- 
able for high-resolution CNV detection. As a result, there 
is currently simultaneously information on the integer 
copy number (CN) genotypes along a CNV region and on 
SNPs outside these regions, in which we will refer in the 
following as CNV-SNP genotypes. 

For diploid organisms, theoretical and empirical argu- 
ments have been made for the use of haplotypes as opposed 
to genotypes. It has been shown that the study of haplo- 
types can improve the power of detecting associations 
with diseases, and a variety of methods exist in the lit- 
erature that use haplotypes to detect causal relationships 
between a genetic region and a phenotype. Furthermore, 
haplotypes enable unique insides in the study of popula- 
tions and are required for many population genetics 
analyses. Specifically, methods for inferring selection 
[5] for studying recombination [6,7] as well as historical 
migration [8,9] build their subsequent analysis on existing 
haplotype data. 

The statistical determination of haplotype phase from 
genotype data is thus potentially very valuable if the estima- 
tion can be done accurately and has received an increasing 
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amount of attention over recent years. A number of 
well-known algorithms have been developed based on 
coalescent theory [10], imperfect phylogeny [11], Mar- 
kov chain Monte Carlo [10,12], Gibbs sampler [13], 
hidden Markov models [14], expectation minimization 
algorithm [15], etc. However, only recently, this problem 
has drawn attention when haplotypes are inferred in a 
CNV-SNP region. 

If we focus within a specific CNV region in a sample 
of individuals and assume that the ploidy is fixed for 
each individual along the region, then the problem of 
inferring the haplotypes is identical to the problem of 
inferring the haplotypes in polyploid organisms or esti- 
mating haplotypes from pooling data. A number of 
algorithms have been proposed for frequency estimation 
and inference on these settings, and not surprisingly, 
many have been applied to the associated CNV haplotype 
inference problem described above. 

Apart from the previous scenarios, a number of meth- 
odologies have been specifically developed and tailored 
for CNV data. Kato et al. [16] have developed a method- 
ology MOCSphaser based on the EM algorithm to assign 
copy numbers in their respective chromosomes in re- 
gions that include CN and SNPs. A core limitation of 
MOCSphaser as described above is that it takes into 
consideration only the total CN and not the alleles 
themselves, assigning on each chromosome a raw CN. 
As a consequence, even though it provides information 
about the total copies on a chromosome that could be 
potentially useful, it does not provide information on 
the diplotypes themselves. 

Another algorithm recently proposed by Kato et al. 
[17], CNVphaser uses an EM approach to perform infer- 
ence. The core limitation of that method is that the in- 
ference is performed within a CNV region and that the 
ploidy is considered fixed for an individual within the 
region. To address these problems and thus enabling the 
phasing of regions where the ploidy of an individual varies 
along the region and each individual can have different 
breakpoints, Su et al. [18] suggested polyHap(v2.0) in 
which they extended the functionality of their original 
methodology for pooling data [19]. In their study, they 
discern the phasing within a CNV into non-internal 
phasing in which the CNV in a chromosome is inferred 
as a diplotype and internal phasing in which the specific 
haplotypes comprising the CNV in a chromosome are 
further identified. We will use these definitions in our 
current work. 

In their algorithm, Su et al. use an HMM methodology 
that has separate emission states for the internal and 
non-internal phasing. They treat the transition between 
states conceptually in a hierarchical two-level model where 
the first level is for the transition among CN states and the 
second for the transition among the haplotype states given 



the CN states. polyHap(v2.0) is the only currently available 
method that can phase complex CNV regions by allowing 
arbitrary changes of CN within individuals and along the 
genomic sequence. 

In this paper, we propose a related new sequential 
Monte Carlo algorithm for haplotype phasing of CNV- 
SNP data. In our method, samples are processed 
sequentially and our method scales linearly with the 
number of samples as well as the number of individ- 
uals. We demonstrate that using our methodology, we 
can achieve state-of-the-art performance while our 
method is an order of magnitude faster than polyHap 
(v2.0). 

Methods 

The structure of this section is as follows. In the beginning 
of the section, we introduce some notation that we will use 
throughout the remaining manuscript. In the subsections 
that follow, we present the modified version of our TDS 
methodology for the case of CNV-SNP data. For complete- 
ness, we develop again our framework in detail as presented 
in [20,21]. We first present some modeling results for the 
prior and posterior distributions for the population haplo- 
type frequencies given the observed data. We then present 
the TDS methodology for the cases of known population 
frequencies and subsequently extend it to the case of 
unknown frequencies. In the derivation of the later, we 
use the previously derived results for the prior and 
posterior distributions for the haplotype frequencies. 
We end the exposition of our method by deriving the 
state update equations for the 'Tree-Based Deterministic 
Sampling CNV (TDSCNV) estimator and presenting 
the modified partition-ligation procedure adjusted for 
the CNV-SNP dataset scenario. In the end of the section, 
we describe the procedure for creating the datasets which 
we have used in the 'Results' section to evaluate our 
methodology. 

Definitions and notation 

Suppose we are given a set of CNV-SNP genotypes on L 
diallelic loci. We denote the two alleles at each locus by 

0 and 1. In the following, we will use the counts of allele 

1 as the provided measurement for each allele on each 
sample. In our method, we allow in a specific position a 
single amplification or deletion. Therefore, if we are within 
a CNV region in a chromosome, the allele counts could 
range from 0 to 2 but could range from 0 to 1 outside 
these regions. 

Suppose that we have T individuals and we denote c t = 
{cj, c^} to be the observed genotype of the £-th sample 
where cje{0, 1, 2, 3,4} are the observed counts on the 
ith. position. Suppose also that C t = {c 1} c t } is a set of 
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individuals up to and including individual t and let C 
denote the full set of individuals. 

In terms of haplotypes, we make an initial distinction 
in the values that alleles take in internal and non- 
internal phasing. The framework that follows however 
will be described generically and will be the same in 
both cases. 

For non-internal phasing, our purpose is to infer haplo- 
typic phase on diploid chromosomes as we are interested 
in the total copies of an allele at a specific position on a 
chromosome. Therefore, the possible values for an allele 
at each position are {-,0,1,01,00,11}. On the contrary for 
internal phasing, we infer haplotypic phase on polyploid 
chromosomes and the possible alleles at each position are 
{-,0,1}. 

For individual t, we denote the haplotypes occurring in 
that individual as h t . In the case of non-internal phasing, 
ht=iKi> hta)- For internal phasing, h t ={h t}1 , h Up }, 
where p is the ploidy of the organism, and p e {1, 2, 3, 4} 
as in our methodology, we only consider a single deletion 
or a single amplification. Therefore, for the case of 
non-internal phasing h tfl) h t>2 are strings of length L in 
which h tfif j e {-, 0, 1, 01, 00, 11} and for internal phasing, 
h tt i are strings of length L in which h t}i} j e {-, 0, 1}. 

We further denote H t = {h ly h t }> similarly to C t as the 
set of haplotypes for each individual up to and including 
individual t 

Let us also define z={z\, z M } as the set containing 
all haplotype vectors of length L that are consistent with 
any genotype in the set C. To obtain Z from the given 
dataset C, we first enumerate for each c t the subset = 
{/z*, hf } i = 1,...,T that contains all possible haplotype 
assignments which are consistent with c t . The set Z is then 
given simply as Z = Uf^if/j. A set of population haplotype 
frequencies 6 = {6i, 0 M } is also associated with the 
set Z of all possible haplotype vectors, where 0 m is the 
probability with which the haplotype z m occurs in the total 
population. We note here once again that we have given 
the definitions of Z and 0 generically for both internal and 
not internal phasing, respectively. 

Prior and posterior distribution for 6 

Assuming random mating in the population, it is clear 
that the number of each unique haplotype in H is drawn 
from a multinomial distribution based on the haplotype 
frequency 0 [22]. This leads us to the use of the Dirichlet 
distribution as the prior distribution for 0 so that 6 ~ D 
(Pi> Pm)- It is well known in Bayesian statistics 
that the Dirichlet distribution is the conjugate prior 
of the multinomial distribution. This implies in our 
case that if we assume that the prior distribution for 
6 is Dirichlet and we draw haplotypes based on their 
frequencies (multinomial distribution), then the posterior 



distribution for 6 is again a Dirichlet distribution. We 
prove this fact below. 

p(0\Ct,H t ,Z)«p(c t \h t = (h tjl ,...,h t , p ),0,C t - 1 ,H t - 1 )p 

x(h t = (h t ^... : h^)\e : c t _ l: H t _ 1: z) P (e\c t _ 1: H t _ l ) 
«p(h t = (h t ^...,h tiP )\e,z)p(e\c t -x,H t ^,z) 

«n^n^ ( '~ iH *n** 

i=l m=l m=l 

P P 

^D^ l (t-i) + ^l(z 1 -h u ),...,p M (t-l) + j2l(z M -h t , i )) 

i=l i=l 

where we denote p m {t) m = 1,...,M as the parameters of 
the distribution of 0 after the £-th pool and I(z m - h t)i ) is 
the indicator function which equals 1 when z m - h tfi is a 
vector of zeros, and 0 otherwise. We note here once 
again that the number of haplotypes (i.e., the index p in 
the assignment) depends on the phasing and is 2 for 
non- internal phasing while it ranges for internal phasing. 
Furthermore, in the previous calculations for 0, for each 
genotype vector, we only consider haplotype configura- 
tions that are consistent with that genotype. 

We have shown that the posterior distribution for 0 is 
also Dirichlet with parameters as given in (1) and depends 
only on the sufficient statistics, T t = \p m (t), l<m<M} 
which can be easily updated based on T t _ lt h t , c t as given 
by (1) i.e., T t = T t {T t _ lf h t) c t ). 

TDS estimator with known system parameters 6 

Similar to traditional sequential Monte Carlo (SMC) 
methods, we assume that by the time we have processed 
genotype c t _ 1} we have a set of K potential solution streams 

(commonly termed as particles) Hf\ (k= 1, K) each 

associated with its corresponding weight wf\ , as 

{{H^\ W ^k = l,...,K}. 

At point t-1, we approximate the real continuous distri- 
bution p(H t _ 1 1 C t _ i) as a discrete distribution as follows: 

iK^-ilQ^) =^j^ w f\l{H t _ l -Hf\) (2) 

K 

where W t -i = X^-i' 
k=i 

and / (•) is the indicator function such that I (x - y) = 1 
for x = y and I (x - y) = 0 otherwise. 

Processing the next individual t, we would like to 
make an online inference of the haplotypes H t based 
on the genotypes Q. From Bayes' theorem, we have 
p e {H t \C t ) ^p 0 (c t \H ti C t . 1 )p 0 (H t \C t . 1 ) «Po(ct\H t ,C t -i)p e (ht\ 
Ht~i, C t -i)p e {Ht-i\Ct-i) «p e (h t \H t - U C t -x)p Q (Ht-\\Ct-\) 
where for our purposes, we only consider haplotype 
assignments for individual t that are compatible to its 
observed genotype. 
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Assume further that there are lC xt such assignments. 
From previous relationships, if we knew the system pa- 
rameters 6, we would be able to approximate the distri- 
bution of pe(H t \C t ) as follows: 



-, K Kext / 

^(^|c t ) = ^EE^(^-KU ( (,) ]) 



t k=l i=l 



(3) 



where j//^,/z^l represents the vector obtained by 
appending the element to the vector Hf\ and 
Wf a = 22w[ kA with 

i,k 

wf' i] ^w ( ^ e {c t \h t = i)p e {h t = i\H { ^). 

TDS estimator with unknown system parameters 9 

However, the system parameters are not known. In our 
model, we use a Dirichlet distribution, as the prior for 6 
and as shown, we obtain a posterior distribution for 0 
(given H t and C t ) that is Dirichlet and only depends on a 
set of sufficient statistics. 

Using Bayes' theorem and similarly to the previous 
subsection, we have: 

p Q (H t \C u ZYp Q [c t \H u C t ^)p Q 

x (h t \H t -i , Q-i )p e (H t -i | Q_! , Z) « Pe (H t -i | Q_ x , Z) Pe 
x(c t \H tl C t -i) J p(h t \H t ^6,Z)p(6\T t ^Z)dd«p e ^ 

x (H t - X | Q_ l5 z) j p(h t \H t . 1 ,e,z) P (e\T t . 1 ,z)de 

where again we only consider haplotype assignments 
that are compatible with the observed genotype. 

Taking into consideration as argued before that if we 
know the system parameters 0, then the p{h t \H t _ 1} 0, Z) 
term represents sampling from a multinomial distribution 
and that the mean of the Dirichlet distribution with re- 
spect to an element 6 k of the vector 6 is as follows: 



E{0 k } 



Pk 

M 

£/>) 

;'=i 



we have from (4) that: 

p e {H t \C u Zy PQ {H t _ x \C t ^Z) J p(h t \H t ^,d,Z)p 

x(e\T t . u z)de-p(H t . 1 \c t . u z) / ([[oi 1 ) P 

J i=l 



<(6\ 
x(^-i| 



/M -I M 

B{p(t-l)+r) f 1 TT^.(w)+r r i,« 



x(H t -i\C t -i,Z) 



B(p(t-l)+r) 
B(p(t-1)) 



where r 



(zi-h t ,i),..., (z M -h t ,i) 



and B(p(t-1)) 



Assuming that we have approximated p(H t _ 1 \ C t _ i) as 
in (2), we can approximate p(H t \ C t ) using (5) as follows: 



-, K Kext 

r t (H t |C0=^EE^( H '-[ w{ «'(^'"-'^)] 



K Kext 

EE 

t k=l i=l 

where the weight update formula is given by: 



B(p«(t-l)+r) 



Wt Wt ~ l B(pW(t-l)) 



where again r 



E(^4).-.E(^4) 



(6) 



and 



(5) 



p u (t - 1) is the parameter vector of the assumed Dirichlet 
prior which represents how many times we have en- 
countered each haplotype in stream k in the solutions 
up to individual t-1. 



Partition-ligation 

In the partition phase, the dataset is divided into small 
segments of consecutive loci and each of the individual 
blocks is phased separately. To ligate the individual blocks, 
we have adjusted the original partition-ligation (PL) method 
for the case of CNV-SNP data. 

In our current implementation, to be able to derive all 
possible solution combinations for each pool genotype 
efficiently, we have decided to keep the maximum block 
length to 5 SNPs. Clearly, the more SNPs are included 
in a block, the more information about the LD patterns 
we can capture but at the same time, the number of pos- 
sible combinations increases and becomes prohibitive for 
more than 5 SNPs. For our experiments in a dataset with 
L loci, we have considered L/5 blocks of 5 consecutive loci 
and the remaining SNPs were treated as a separate block. 

The result of phasing for each block is a set of haplotype 
solutions for each genotype. Two neighboring blocks are 
ligated by creating merged solutions for each genotype 
from combinations of the block solutions, each associated 
with the product of the individual solution weights called 
the ligation weight 

Depending on which haplotypes one from each block 
are going to be assigned on the same chromosome for each 
individual, a different number of changes in the ploidy of that 
individual will occur. In our method, we consider only the 
assignments that will produce the minimum number of such 
changes. Therefore, if both haplotypes in any block have the 
same CN, we examine both alternative assignments but we 
otherwise ligate solutions that have the same CN. The TDS 
algorithm is then repeated in the same manner as it was for 
the individual blocks with the weights of the solutions scaled 
by the associated ligation weight for that solution. 
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Summary of the proposed algorithm 

Routine 1: 



Set the current number of solution streams m = 1. Define K as the maximum number of 
solution streams allowed. Define H^= {}. 

Find all possible haplotype assignments for each genotype and rearrange the genotypes in 
ascending order according to the number of distinct haplotype solutions each one of them 
has. 

For* =i,2,... 

o Find the K ext possible haplotype configurations compatible with the genotype of 

the t-th sample, 
o For£=7,2,...,m,y = l,...X xt . 

■ Enumerate all possible solution stream extensions 

*™ = hi P )}. 

■ Vy compute the weights w t ,J "' according to (6). 

o Select and preserve M— min (K, m • K 6 **) distinct sample streams {H^ , k = 
1,...,M} with the highest importance weights { wf* , k = 7,. . .,M} from the set 

{H^,w? 4 \ k = l,...,m,j = /,..., K™'}. 
o Update the number of counts of each encountered haplotype in each stream, 
o Set m = M. 



TDSCNV algorithm 

Partition the genotype dataset C into B subsets. 

For b = 1,...,B, apply Routine 1 so that all segments are phased, and for each one, keep 
all the solutions contained in the top ^particles. 
Until all blocks are ligated, repeat the following: 

o Find the blocks that if ligated would produce the minimum entropy. 

o Ligate the blocks, following the procedure described in the Partition-Ligation 
section. 



Dataset creation 

Our datasets consisted of SNPs from chromosomes 1 
and 2 from HapMap CEU population (HapMap3 re- 
lease 2 - phasing data). For our purposes, we have con- 
sidered only the parents in each trio which are the 
unrelated individuals in our dataset thus resulting in a 
total of 88 individuals. We have initially filtered out 
SNPs with minor allele frequencies less than 5%, and 
we have then considered non-overlapping datasets 
with a fixed number of SNPs. To create artificial CNV 
regions within each dataset, we have used the following 
procedure. 

First, in each dataset, we have found all the different 
haplotypes appearing in the dataset. In order to retain as 



much of the LD structure and also the property that 
most of the CNVs could be flagged by neighboring SNPs 
[2], we have randomly replaced specific areas of randomly 
chosen haplotypes with a CNV haplotype. To perform 
that procedure, we randomly selected haplotypes based on 
their frequency in the population and modified them 
inserting CNV regions sequentially as follows. Each pos- 
ition was considered as the beginning of a CNV region 
with a probability of 0.1. For each position flagging the 
beginning of a CNV, we assigned the length of the CNV 
region uniformly between three to eight SNPs. We then 
progressed along the haplotype from the end of the 
CNV region in a similar fashion until we reached the 
end of a given haplotype. 
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Table 1 Switch error rate Switch error rates for non-internal 
phasing 







Number of markers 






30 


50 


100 


TDSCNV 


0.115 


0.127 


0.14 


polyHap(v2.0) 


0.128 


0.135 


0.138 



The switch error rate presented for each number of markers is the average on 
1 00 datasets. 



Results 

Measurement of phasing accuracy 

We have used a number of different measures to evaluate 
the performance of our methodology. First, the switch 
error rate [23,24] is defined as the percentage of switches 
among all possible switches in haplotype orientation used 
to recover the correct phase in an individual 

In the case of a small number of loci where haplotype 
vectors can be expected to be reconstructed exactly, we 
have used two figures of merit namely the x 2 and l\ dis- 
tance to evaluate the accuracy of frequency estimation. 
Suppose that /are the predicted haplotype frequencies from 
an algorithm and g are the gold standard population level 
haplotype frequencies. The x 2 distance between the two 
distributions is simply the result of the x 2 statistic, i.e., 
d 

X 2 (fig) = ^ (frSi) I St where d is the number of gold 

i=l 

standard haplotypes whereas the l x distance between the 

d 

two distributions is defined as h(f,g) = ^^\frgi\ [25]. 



Switch error rate 

We have compared the performance of our method with 
polyHap(v2.0) for haplotypic phase inference using the 
switch error rate. In this section, the evaluation was done 
on non-internal haplotypes. In the evaluation of the switch 
error rate, we consider only CN and SNP positions that 
are ambiguous. For a marker genotype to have ambiguous 
phasing, there should be at least two alternative orienta- 
tion assignments. As an example, all 3CN genotypes are 
ambiguous positions. This is easy to see, as the choice 
alone of the chromosome that would have the duplication 
creates two distinct possible assignments. 

The performance of our method when considering the 
full set of individuals in each dataset is shown in Table 1. 
We have considered three marker sizes namely 30, 50, 
and 100 markers. For each marker size, we have simu- 
lated 100 datasets and the result presented is the average 
error rate on these 100 datasets. We can see that for 30 
and 50 markers, our method was marginally better than 
polyHap(v2.0), whereas for the 100 marker datasets, it 
was marginally worse. 

We further demonstrate the accuracy of our approach 
when ranging the number of individuals in each dataset. 
The results for a fixed number of 30 and 50 markers are 
shown in Figure 1. As expected, the performance for both 
methods improves with increasing number of individuals 
per dataset. 

Finally, we have broken down and calculated the switch 
error rates based on the CN of the 'from' and 'to' sites as 
shown in Table 2. Similarly, to Su et al, we observe the 





30 SNP Datasets 




50 SNP Datasets 


0.17 






0.175 
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—-X-— polyHap 




* 






— o- -TDSCNV 


0.17 












0.16 


□ 
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00 






o> 0.155 
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I 0.14 
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I 0.15 
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o 

W 0.145 




0.13 














\ 


0.14 
0.135 




0.12 








\ 






□ 


0.13 


□ 


0.11 










30 50 


88 


0.125 


30 50 88 




Number of Individuals 




Number of Individuals 


Figure 1 Switch error rate. Estimating the switch error rate for non-internal phasing on datasets having a varying number of individuals with 


polyHap(v2.0) and TDSCNV. 
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Table 2 Switch error rate for non-internal phasing 
according to the CN of the respective consecutive 
ambiguous markers 

CN on second site 
CN on first site 1 2 

1 0.117 0.227 

2 0.229 0.012 



Table 3 Timing results 

Number of markers 
30 50 100 

TDSCNV 2.1 3.7 5.7 

polyHap (v2.0) 262.3 431.5 892.1 

For each method and each marker size, the computational time is the average 
time on the 100 datasets used in the switch error rate calculation. Time is 
given in seconds. 



highest switch error rates appearing when the transitions 
happen between different CNs. 

Haplotype frequency estimation 

We have examined the accuracy of our method and 
compared it against polyHap(v2.0) on datasets of 8 
and 10 markers in which individuals had a fixed 
ploidy. We have evaluated two appropriate figures of 
merit as described above, the x 2 and l x distance. We 
should note here that in order to determine how 
good frequency estimations with a given method are, 
a small number of markers should be used. The rea- 
son is that for a large number of markers, it would 
be unlikely that the exact same haplotypes would ap- 
pear or reconstructed with appreciable frequency. The 
results for both figures of merit on an increasing num- 
ber of individuals are shown in Figure 2. Our method 
demonstrates superior performance for both figures of 
merit, and again as expected, both methods produced 
superior performance with an increasing number of 
individuals. 



Internal phasing 

We have further evaluated the performance of our method 
using the switch error rate inside duplicated regions. In this 
subsection, the evaluation was done on internal phasing 
and particularly in duplicated segments of a chromosome 
as the scope was to detect how good the specific haplotypes 
comprising the duplicated chromosomal region could be 
recovered. The switch error rate evaluation within such 
duplicated regions is exactly the same as the evaluation on 
a genotype with only SNPs. 

We have used the same 100 datasets for each of the 
three dataset sizes, namely 30, 50, and 100 markers, as in 
the evaluation of the switch error rate for non-internal 
phasing described in a previous subsection. We found, as 
expected, that the results were similar irrespectively of the 
dataset size, and the average across all datasets was 0.183. 

Timing results 

The computational times for the 30, 50, and 100 marker 
datasets used for the calculation of the switch error rate 
are displayed in Table 3. We can see that TDSCNV is an 
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order of magnitude faster than polyHap(v2.0) for all 
marker sizes examined. 

Discussion 

We present an algorithm for haplotypic inference in 
regions of CNV-SNP genotypes. We compare our 
method with polyHap(v2.0) on a variety of marker sizes 
and evaluate the accuracy and computational time of 
each method. Our method has similar accuracy to 
polyHap(v2.0) but is an order of magnitude faster in all 
datasets examined. 

In all instances of haplotype inference problems, it 
becomes increasingly significant that methods are able 
to incorporate prior knowledge in the form of haplo- 
types or genotypes from the same population as that 
from which the target samples were drawn. HapMap is a 
striking example of such database knowledge that could 
be used for haplotype inference. Furthermore, it is also 
important for researchers that samples that are phased 
at some point in time could be used efficiently for the 
phasing of samples presented at some later point. Our 
methodology offers a unique framework that can easily 
incorporate such prior knowledge. Haplotypes can be 
introduced in the form of a prior for the counts in the 
TDSCNV algorithm. From our experience with our 
framework and as expected, the presence of the extra 
information will improve the phasing accuracy of the 
target samples. 

Conclusions 

In this paper, we propose a new sequential Monte Carlo 
algorithm for haplotype phasing of CNV-SNP data. In 
our method, samples are processed sequentially and our 
method scales linearly with the number of samples as 
well as the number of individuals. 

To demonstrate the performance of our method, we 
have compared it against polyHap(v2.0), the only currently 
available software able to perform inference in CNV/SNP 
genotypes, on datasets of varying number of markers. We 
have initially compared the accuracy of both methods 
for haplotypic phase inference on non-internal haplo- 
types, on datasets of 30, 50, and 100 markers. We have 
then examined the accuracy of frequency estimation 
with both methods on datasets with a small number of 
markers (8 and 10 markers). Finally, we have evaluated 
the performance of our methodology inside duplicated 
regions for internal phasing. 

We have found that our method demonstrates com- 
parable or better accuracy than polyHap(v2.0) and at the 
same time is an order of magnitude faster in all datasets 
and marker sizes examined while scaling linearly with 
the number of markers and number of individuals. We 
therefore believe that our method could be the method 
of choice for haplotype inference in such datasets. 



Competing interests 

All authors declare they have no competing interests. 

Received: 8 December 2013 Accepted: 26 March 2014 
Published: 24 April 2014 



References 

1. DF Conrad, ME Hurles, The population genetics of structural variation. Nat 
Genet 39(7 Suppl), S30-S36 (2007) 

2. DF Conrad, D Pinto, R Redon, L Feuk, 0 Gokcumen, Y Zhang, J Aerts, TD 
Andrews, C Barnes, P Campbell, T Fitzgerald, M Hu, CH Ihm, K Kristiansson, 
DG Macarthur, RJ Macdonald, I Onyiah, AW Pang, S Robson, K Stirrups, A 
Valsesia, K Walter, J Wei, C Tyler-Smith, NP Carter, C Lee, SW Scherer, ME 
Hurles, The Wellcome Trust Case Control Consortium, Origins and functional 
impact of copy number variation in the human genome. Nature 
464(7289), 704-712 (2010) 

3. R Redon, S Ishikawa, KR Fitch, L Feuk, GH Perry, TD Andrews, H Fiegler, MH 
Shapero, AR Carson, W Chen, EK Cho, S Dallaire, JL Freeman, JR Gonzalez, M 
Grataoos, J Huang, D Kalaitzopoulos, D Komura, JR MacDonald, CR Marshall, 
R Mei, L Montgomery, K Nishimura, K Okamura, F Shen, MJ Somerville, J 
Tchinda, A Valsesia, C Woodwark, F Yang et al., Global variation in copy 
number in the human genome. Nature 444(71 18), 444-454 (2006) 

4. SA McCarroll, DM Altshuler, Copy-number variation and association studies 
of human disease. Nat Genet 39(7 Suppl), S37-S42 (2007) 

5. PC Sabeti, DE Reich, JM Higgins, HZ Levine, DJ Richter, SF Schaffner, SB 
Gabriel, JV Platko, NJ Patterson, GJ McDonald, HC Ackerman, SJ Campbell, D 
Altshuler, R Cooper, D Kwiatkowski, R Ward, ES Lander, Detecting recent 
positive selection in the human genome from haplotype structure. Nature 
419(6909), 832-837 (2002) 

6. P Fearnhead, P Donnelly, Estimating recombination rates from population 
genetic data. Genetics 159(3), 1299-1318 (2001) 

7. SR Myers, RC Griffiths, Bounds on the minimum number of recombination 
events in a sample history. Genetics 163(1), 375-394 (2003) 

8. M Bahlo, RC Griffiths, Inference from gene trees in a subdivided population. 
Theor Popul Biol 57(2), 79-95 (2000) 

9. P Beerli, J Felsenstein, Maximum likelihood estimation of a migration matrix 
and effective population sizes in n subpopulations by using a coalescent 
approach. Proc Natl Acad Sci U S A 98(8), 4563-4568 (2001) 

10. M Stephens, P Scheet, Accounting for decay of linkage disequilibrium in 
haplotype inference and missing-data imputation. Am J Hum Genet 
76(3), 449-462 (2005) 

11. E Halperin, E Eskin, Haplotype reconstruction from genotype data using 
Imperfect Phylogeny. Bioinformatics 20(12), 1842-1849 (2004) 

12. S Lin, A Chakravarti, DJ Cutler, Haplotype and missing data inference in 
nuclear families. Genome Res 14(8), 1624-1632 (2004) 

13. T Niu, ZS Qin, X Xu, JS Liu, Bayesian haplotype inference for multiple linked 
single-nucleotide polymorphisms. Am J Hum Genet 70(1), 157-169 (2002) 

14. SR Browning, Missing data imputation and haplotype phase inference for 
genome-wide association studies. Hum Genet 124(5), 439-450 (2008) 

15. ZS Qin, T Niu, JS Liu, Partition-ligation-expectation-maximization algorithm 
for haplotype inference with single-nucleotide polymorphisms. Am J Hum 
Genet 71(5), 1242-1247 (2002) 

16. M Kato, Y Nakamura, T Tsunoda, MOCSphaser: a haplotype inference tool 
from a mixture of copy number variation and single nucleotide 
polymorphism data. Bioinformatics 24(14), 1645-1646 (2008) 

1 7. M Kato, Y Nakamura, T Tsunoda, An algorithm for inferring complex 
haplotypes in a region of copy-number variation. Am J Hum Genet 
83(2), 157-169 (2008) 

18. SY Su, JE Asher, MR Jarvelin, P Froguel, Al Blakemore, DJ Balding, LJ Coin, 
Inferring combined CNV/SNP haplotypes from genotype data. 
Bioinformatics 26(11), 1437-1445 (2010) 

19. SY Su, J White, DJ Balding, LJ Coin, Inference of haplotypic phase and 
missing genotypes in polyploid organisms and variable copy number 
genomic regions. BMC Bioinform 9, 513 (2008) 

20. A Iliadis, D Anastassiou, X Wang, Fast and accurate haplotype frequency 
estimation for large haplotype vectors from pooled DNA data. BMC Genet 
13, 94 (2012) 

21. A Iliadis, J Watkinson, D Anastassiou, X Wang, A haplotype inference 
algorithm for trios based on deterministic sampling. BMC Genet 
11,78 (2010) 



Iliadis et al. EURASIP Journal on Bioinformatics and Systems Biology 2014, 2014:7 
http://bsb.eurasipjournals.com/content/201 4/1 II 



Page 9 of 9 



22. L Excoffier, M Slatkin, Maximum-likelihood estimation of molecular 
haplotype frequencies in a diploid population. Mol Biol Evol 
12(5), 921-927 (1995) 

23. S Lin, DJ Cutler, ME Zwick, A Chakravarti, Haplotype inference in random 
population samples. Am J Hum Genet 71(5), 1129-1137 (2002) 

24. J Marchini, D Cutler, N Patterson, M Stephens, E Eskin, E Halperin, S Lin, ZS 
Qin, HM Munro, GR Abecasis, P Donnelly, International HapMap Consortium, 
A comparison of phasing algorithms for trios and unrelated individuals. Am 
J Hum Genet 78(3), 437-450 (2006) 

25. B Kirkpatrick, CS Armendariz, RM Karp, E Halperin, HAPLOPOOL: improving 
haplotype frequency estimation through DNA pools and phylogenetic 
modeling. Bioinformatics 23(22), 3048-3055 (2007) 



doi:1 0.1 1 86/1 687-41 53-201 4-7 

Cite this article as: Iliadis et at A sequential Monte Carlo framework for 
haplotype inference in CNV/SNP genotype data. EURASIP Journal on 
Bioinformatics and Systems Biology 2014 2014:7. 



Submit your manuscript to a SpringerOpen 0 
journal and benefit from: 

► Convenient online submission 

► Rigorous peer review 

► Immediate publication on acceptance 

► Open access: articles freely available online 

► High visibility within the field 

► Retaining the copyright to your article 



Submit your next manuscript at ► springeropen.com 



